arXiv: 1506.01418v2 [stat.ML] 28 Sep 2015 


Parallel Stochastic Gradient Markov Chain Monte Carlo for 
Matrix Factorisation Models 


Umut §im§ekli 1 
Hazal Koptagel 1 
Hakan Gulda ^ 1 
A. Taylan Cemgil 1 
Figen Oztoprak 2 
§. Ilker Birbil 3 


UMUT.SIMSEKLI@ BOUN.EDU.TR 
HAZAL.KOPTAGEL@ BOUN.EDU.TR 
HAKAN.GULDAS@BOUN.EDU.TR 
TAYLAN.CEMGIL@BOUN.EDU.TR 
FIGEN.OZTOPRAK@ BILGI.EDU.TR 
SIBIRBIL @ SABANCIUNIV.EDU 


1: Department of Computer Engineering, Bogazigi University, Istanbul, Turkey 
2: Department of Industrial Engineering, Bilgi University, Istanbul, Turkey 
3: Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey 


Abstract 

For large matrix factorisation problems, we de¬ 
velop a distributed Markov Chain Monte Carlo 
(MCMC) method based on stochastic gradient 
Langevin dynamics (SGLD) that we call Paral¬ 
lel SGLD (PSGLD). PSGLD has very favourable 
scaling properties with increasing data size and 
is comparable in terms of computational require¬ 
ments to optimisation methods based on stochas¬ 
tic gradient descent. PSGLD achieves high per¬ 
formance by exploiting the conditional indepen¬ 
dence structure of the MF models to sub-sample 
data in a systematic manner as to allow paralleli¬ 
sation and distributed computation. We provide 
a convergence proof of the algorithm and verify 
its superior performance on various architectures 
such as Graphics Processing Units, shared mem¬ 
ory multi-core systems and multi-computer clus¬ 
ters. 

1. Introduction 

Matrix factorisation (MF) models have been widely used in 
data analysis and have been shown to be useful in various 
domains, such as recommender systems, audio processing, 
finance, computer vision, and bioinformatics (Smaragdis 
& Brown, 2003; Devarajan, 2008; Cichoki et al., 2009). 
The aim of a MF model is to decompose an observed 
data matrix V E H 7x J in the form: V ~ WH, where 
W E IR 7xK and H E H 7CxJ are the factor matrices, 
known typically as the dictionary and the weight matrix re¬ 
spectively, to be estimated by minimising some error mea¬ 
sure such as the Frobenious norm || V — WH||i?. 

More general noise models and regularisation methods can 


be developed. One popular approach is using a probabilis¬ 
tic MF model having the following hierarchical generative 
model: 

P( W) = I ~[p(w ik ), p( H) = Y[p(h kj ) 

ik kj 

p(V|WH)=np(vy|w i ,h J -) (1) 

ij 

where, w i denotes the i th row of W and h j denotes the j th 
column of H 1 . In MF problems we might be interested in 
two different quantities: 

1. Point estimates such as the maximum likelihood (ML) 
or maximum a-posteriori (MAP): 

(W, H)* = arg max logp(W, H|V) (2) 

W.H 

2. The full posterior distribution: 

p( W, H| V) oc p(V |W, H)p(W)p(H) (3) 

The majority of the current literature on MF focuses on 
obtaining point estimates via optimisation of the objective 
given in Equation 2. Point estimates can be useful in prac¬ 
tical applications and there is a broad literature for solving 
this optimisation problem for a variety of choices of prior 
and likelihood functions, with various theoretical guaran¬ 
tees (Lee & Seung, 1999; Liu et al., 2010; Fevotte & Idier, 
2011; Gemulla et al., 2011; Recht & Re, 2013). In con¬ 
trast, Monte Carlo methods that sample from the often in¬ 
tractable full posterior distribution (in the sense of com¬ 
puting moments or the normalizing constant) received less 

'in the rest of the paper, we will use bold capital letters to 
denote matrices, e.g., A, bold small letters to denote vectors, e.g., 
a, and small regular letters to denote scalars, e.g., a. 
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attention, mainly due to the perceived computational ob¬ 
stacles and rather slow convergence of standard methods, 
such as the Gibbs sampler, for the target density in 3. 

Having an efficient sampler that can generate from the full 
posterior in contrast to a point estimate would be useful in 
various applications such as model selection (i.e., estimat¬ 
ing the ‘rank’ K of the model) or estimating the Bayesian 
predictive densities useful for active learning. Yet, despite 
the well known advantages, Monte Carlo methods are typ¬ 
ically not the method choice in large scale MF problems as 
they are perceived to be computationally very demanding. 
Indeed, classical approaches based on batch Metropolis- 
Hastings would require passing over the whole data set at 
each iteration and the acceptance step makes the methods 
even more impractical for large data sets. Recently, alter¬ 
native approaches have been proposed to scale-up MCMC 
inference to large-scale regime. An important attempt was 
made by Welling and Teh (2011), where the authors com¬ 
bined the ideas from a gradient-based MCMC method, 
so called the Langevin dynamics (LD) (Neal, 2010) and 
the popular optimisation method, stochastic gradient de¬ 
scent (SGD) (Kushner & Yin, 2003), and developed a scal¬ 
able MCMC framework called as the stochastic gradient 
Langevin dynamics (SGLD). Unlike conventional batch 
MCMC methods, SGLD requires to ‘see’ only a small sub¬ 
set of the data per iteration similar to SGD. With this man¬ 
ner, SGLD can handle large datasets while at the same time 
being a valid MCMC method that forms a Markov Chain 
asymptotically sampling from the target density. Approxi¬ 
mation analysis of SGLD has been studied in (Sato & Nak- 
agawa, 2014) and (Teh et al., 2014). Several extensions 
of SGLD have been proposed. Ahn et al. (2012) made 
use of the Fisher information besides the noisy gradients, 
Patterson and Teh (2013) applied SGLD on the probability 
simplex. Chen et al. (2014) and Ding et al. (2014) consid¬ 
ered second order Langevin dynamics and made use of the 
momentum terms, extending the vanilla SGLD. 

In this study, we develop a parallel and distributed MCMC 
method for sampling from the full posterior of a broad 
range of MF models, including models not easily tackled 
using standard methods such as the Gibbs sampler. Our 
approach is carefully designed for MF models and builds 
upon the generic distributed SGLD (DSGLD) framework 
that was proposed in (S. Ahn & Welling, 2014) where sep¬ 
arate Markov chains are run in parallel on different subsets 
of the data that are distributed among worker nodes. When 
applied to MF models, DSGLD results in computational 
inefficiencies since it cannot exploit the conditional inde¬ 
pendence structure of the MF models. Besides, DSGLD 
requires all the latent variables (i.e., W and H) to be syn¬ 
chronised once in a couple of iterations which introduces 
a significant amount of communication cost. On the other 
hand, for large problems it may not even be possible to 


store the latent variables in a single machine; one needs to 
distribute the latent variables among the nodes as well. 

We propose a novel parallel and distributed variant of 
SGLD for MF models, that we call Parallel SGLD (PS- 
GLD). PSGLD has very favourable scaling properties with 
increasing data size, remarkably upto the point that the re¬ 
sulting sampler is computationally not much more demand¬ 
ing than an optimisation method such as the distributed 
stochastic gradient descent (DSGD) (Gemulla et al., 2011). 
Reminisicent to DSGD, PSGLD achieves high perfor¬ 
mance by exploiting the conditional independence struc¬ 
ture of the MF models for sub-sampling the data in a sys¬ 
tematic manner as to allow parallelisation. The main ad¬ 
vantages of PSGLD can be summarised as follows: 

• Due to its inherently parallel structure, PSGLD is faster 
than SGLD by several orders of magnitude while being 
as accurate. 

• As we will illustrate in our experiments, PSGLD can 
easily be implemented in both shared-memory and dis¬ 
tributed architectures. This makes the method suitable 
for very large data sets that might be distributed among 
many nodes. 

• Unlike DSGLD, which requires to communicate all the 
parameters W and H among the worker nodes, PSGLD 
communicates only small parts of H. This drastically 
reduces the communication cost for large W and H. 

• The probability distribution of the samples generated by 
PSGLD converges to the Bayesian posterior. 

We evaluate PSGLD on both synthetic and real datasets. 
Our experiments show that, PSGLD can be beneficial in 
two different settings: 1) a shared-memory setting, where 
we implement PSGLD on a graphics processing unit (GPU) 
2) a distributed setting, where we implement PSGLD on a 
cluster of computers by using a message passing protocol. 
Our results show that, in the shared-memory setting, while 
achieving the same quality, PSGLD is 700+ times faster 
than a Gibbs sampler on a non-negative matrix factorisa¬ 
tion problem; and in the distributed setting, PSGLD easily 
scales-up to matrices with hundreds of millions of entries. 

We would like to note that, a DSGLD-based, distributed 
MF framework has been independently proposed by Ahn 
et al. (2015), where the authors focus on a particular MF 
model, called as the Bayesian probabilistic matrix factori¬ 
sation (BPMF) (Salakhutdinov & Mnih, 2008). In this 
study, we focus on a generalised observation model family 
(Tweedie models), in which we can obtain several observa¬ 
tion models that have been used in important MF models 
(such as BPMF, Poisson non-negative matrix factorisation 
(NMF) (Lee & Seung, 1999), Itakura-Saito NMF (Fevotte 
et al., 2009)) as special cases. 
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Figure 1 . Illustration of the parts and the blocks. Here, we partition the sets [I] and [J] into B — 3 pieces. The partitions for this example 
are chosen as Vb([I]) = {{1, 

1 ,. 


§},{§ + !,...,f},{f + l,...,J}}andP B ([J]) = + 

, J}}. Part 1 consists of three non-overlapping blocks, say n = Ai U A2 U A3, where Ai — { 1 ,..., 3} x { 1 ,..., ^}, A2 = 
+ f},andA 3 = {f+ 1,x{f+l, 


3 j ~ r-*-5 • • • 5 3 j 

., J}. Given the blocks in a part, the corresponding 


blocks in W and H become conditionally independent, as illustrated in different colours and textures. Therefore, for different blocks, 
the PSGLD updates can be applied in parallel. 


2. Stochastic Gradient Langevin Dynamics 
(SGLD) for Matrix Factorisation 

In the last decade, SGD has become very popular due to 
its low computational requirements and convergence guar¬ 
antee. SGLD brings the ideas of SGD and LD together in 
order to generate samples from the posterior distribution 
in a computationally efficient way. In algorithmic sense, 
SGLD is identical to SGD except that it injects a Gaussian 
noise at each iteration. For MF models, SGLD iteratively 
applies the following update rules in order to obtain the 
samples W® and H®: + AW^ and 

H (t) = h^- 1 ) + AHW, where 

AW (I) =e (t) (i V w logp(n;j|W^ _1 \ H (t_1) ) 

+ Vwlogp(W (t " 1 >)) + * (t) 

AH( f ) =e (t) (y^L y^Vnlogp^lW^- 1 ),^- 1 )) 
'(ij)enm 

+ V H logp(H( t - 1 ))) +S (t) . 

Here, N is the number of elements in V, t = 1,..., T 
denotes the iteration number, C [I] x [J] is the sub¬ 
sample that is drawn at iteration t, the set [I] is defined as 
[I] = {1V denotes the gradients, and |fiW | de¬ 
notes the number of elements in The elements of the 
noise matrices ^ (t) and S (t) are independently Gaussian 
distributed: 

4k ~ 0, 2e«), eg - W(eg; 0,2e«). 

For convergence, the step size must satisfy the follow¬ 
ing conditions: 

00 00 

= oc, ) 2 < oc (4) 

t=o t=o 

A typical choice for the step size is = 0(l/t). 

In SGLD, the sub-sample can be drawn with or with¬ 
out replacement. When dealing with MF models, instead of 


sub-sampling the data arbitrarily, one might come up with 
more clever sub-sampling schemas that could reduce the 
computational burden drastically by enabling parallelism. 
In the next section, we will describe our novel method, PS¬ 
GLD, where we utilise a systematic sub-sampling schema 
by exploiting the conditional independence structure of MF 
models. 

3. Parallel SGLD for Matrix Factorisation 

In this section, we describe the details of PSGLD. In¬ 
spired by (Liu et al., 2010; Gemulla et al., 2011; Recht & 
Re, 2013), PSGLD utilises a biased sub-sampling schema 
where the observed data is carefully partitioned into mu¬ 
tually disjoint blocks and the latent factors are also parti¬ 
tioned accordingly. An illustration of this approach is de¬ 
picted in Figure 1. In this particular example, the observed 
matrix V is partitioned into 3x3 disjoint blocks and the la¬ 
tent factors W and H are partitioned accordingly into 3x1 
and 1x3 blocks. At each iteration, PSGLD sub-samples 3 
blocks from V, called as the parts , in such a way that the 
blocks in a part will not ‘touch’ each other in any dimension 
of V, as illustrated in Figure 1. This biased sub-sampling 
schema enables parallelism, since given a part, the SGLD 
updates can be applied to different blocks of the latent fac¬ 
tors in parallel. 

In the example given in Figure 1, we arbitrarily partition 
the data into 9 equal-sized blocks where these blocks are 
obtained in a straightforward manner by partitioning V us¬ 
ing a 3 x 3 grid. In the general case, the observed matrix 
V will be partitioned into B x B = B 2 blocks and these 
blocks can be formed in a data-dependent manner, instead 
of using simple grids. 

Let us formally define a block and a part. First, we need 
to define a partition of a set S as Vb{S) where Vb(S) 
contains non-empty disjoint subsets of S , whose union is 
equal to S. Here, B denotes the number of subsets that 
the partition V contains. We will define the blocks and the 
parts by using partitions of the sets [I] and [J]. 

Definition 1. A block , Ac [I] x [J] is the Cartesian prod- 
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uct of two sets, one of them being in Vb([I]) and the other 
one being in Vb([J])- Formally, it is defined as follows: 

A =lxj (5) 

where X G Vb {[/]) and J G Pb([J])- 

Definition 2. A part, C [I] H [J] at iteration t, is 
a collection of mutually disjoint blocks and is defined as 
follows: 

n (t) = uf =1 Af = uf =1 jf x (6) 

where all the blocks are mutually disjoint, formally, 

4 t] e Pb([I\), Jl t] e Vb([J}) 

i ( y n i ( y = 0, n = 0, V6 f b'. 

Suppose we read a part IIW = U^ =1 A^ at iteration t. 
Then the SGLD updates for W can be written as follows: 

AWW =eW (m E Vwlogp(^l-) 

+ V w logp(W (< - 1) )) +^ (t) 

=eW (m^E E Vwlo §^ii-) 

6=1 (idjeA**) 

+ V w logpfW* 1 - 1 ))) + * (t) (7) 

Since all are mutually disjoint, we can decompose 
Equation 7 into B interchangeable updates (i.e., they can 
be applied in any order), that are given as follows: = 

w£ t_1) + Awf, where 

AW« =e W (^L 

• Vwjogpiw;;- 1 ')) I <’ (8) 

for all b = 1,..., B. Here, and are the latent 
factor blocks at iteration t, that are determined by the cur¬ 
rent data block x and are formally defined 

as follows: 

ie4*\ke[K\} 

jejf\ke[K]} 

The noise matrix is of the same size as and its 
entries are independently Gaussian distributed with mean 0 
and variance 2e^\ 


Similarly, we obtain B interchangeable update rules for H 
that are given as follows: + AH^, where 

AH ^ } =e(t) (\m E^iogpK-iw^.H^i)) 
(ij)e A® 

+ V H >gp(H< t - 1) ))+E< t) (9) 

for all b = 1,..., B. Similarly, is of the same size as 
H 5 and its entries are independently Gaussian distributed 
with mean 0 and variance 2e^. The parallelism of PSGLD 
comes from the fact that all these B update rules are in¬ 
terchangeable, so that we can apply them in parallel. The 
pseudo-code of PSGLD is given in Algorithm 1. 

3.1. Convergence Analysis 

Since we are making use of a biased sub-sampling schema, 
it is not clear that the samples generated by PSGLD will 
converge to the Bayesian posterior. In this section, we will 
define certain conditions on the selection of the parts and 
provided these conditions hold, we will show that the prob¬ 
ability distribution of the samples and converges 
to the Bayesian posterior p(W, H|V). 

Lor theoretical use, we define 0 as the parameter vector, 
that contains both W and H: 

6 = [vec(W) T , vec(H) T ] T (10) 

where vec(-) denotes the vectorisation operator. We also 
define 

c(o^)= iogp(# (t) ) + logptu^j#^) 

£(0 {t) ) = log p(6 {t) ) + -X- logp(%|0 (t) ) 

' ' i,je n(*) 

Then, the stochastic noise is given by 

C (t) = V e £(0 (t) ) - Ve£(0 w ). (11) 

Under the following conditions Theorem 1 holds. 
Condition 1. The step size satisfies Equation 4. 

Condition 2. The part is chosen from B nonoverlap¬ 
ping parts whose union covers the whole observed matrix 
V (e.g., the parts given in Figure 1). The probability of 
choosing a part at iteration t is proportional to its 
size: 

p( n (t) =n) = ^l. 

Condition 3. E[(£®) fe ] < oc, for integer k > 2. 
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Algorithm 1: PSGLD for matrix factorisation. 

Input: V, Wf°), H(°), T, B, V B {[I]), V B {[J\) 

Output: {WW,HW}L 
for t <— 1 to T do 
Set the step size 


Pick a part 




r= lAb 

for each block in II 1 ' do in parallel 

= wf _1) 

_ — 1) 


Awf 


■ ah[ () 


//(Eq. 8) 
//{Eq. 9) 

/* Optional mirroring step for non-negativity. 

| • | is element-wise absolute value operator. */ 

wf <- |W, 


H 


W 


<- IH 


Wi 
b I 
W| 


end 


end 


Theorem 1. Let q t (0) be the probability density function of 
the samples 6^ that are generated by PSGLD. Then , the 
probability distribution of 0^ converges to the Bayesian 
posterior p(0\V): 


lim q t (0)=p(0\V). (12) 

t—yoo 

Proof sketch. 

Under Condition 2, we can show that t is an unbiased esti¬ 
mator of C\ therefore the stochastic noise is zero-mean: 

E[C (t) ] = 0. 

The rest of the proof is similar to (Sato & Nakagawa, 2014). 
Under conditions 1 and 3, we can show that qt(0) fol¬ 
lows the (multi-dimensional) Fokker-Plank equation and 
therefore the stationary distribution of qt(0) is p(0\ V) oc 
exp(— C(0)). □ 


3.2. Non-negativity Constraints 

In certain applications, all the elements of V, W, and H 
are required to be non-negative, that is known as the non¬ 
negative matrix factorisation (NMF) (Lee & Seung, 1999). 
As we will illustrate in Section 4, the non-negativity con¬ 
straint is often a necessity in certain probabilistic mod¬ 
els, where we essentially decompose the parameters of 
the probabilistic model that are non-negative by definition 
(e.g., the intensity of a Poisson distribution or the mean of 
a gamma distribution). 

In an SGD framework, the latent factors can be kept in a 
constraint set by using projections that apply the minimum 
force to keep the variables in the constraint set. However, 
since we are in an MCMC framework, it is not clear that 
appending a projection step to the PSGLD updates would 


still result in a proper MCMC method. Instead, similar to 
(Patterson & Teh, 2013), we make use of a simple mirroring 
trick, where we replace the negative entries of W® and 
with their absolute values. Formally, we let and 
hkj take values in the whole H, however we parametrise the 
prior and the observation models with the absolute values, 
\wik\ and \hkj\. Since and -w^ (similarly, and 
—h^j) will be equiprobable in this setting, we can replace 
the negative elements of W® and with their absolute 
values without violating the convergence guarantee. 


4. Experiments 

In this section we will present our experiments where we 
evaluate PSGLD on both synthetic and real datasets using 
the non-negative matrix factorisation (NMF) model. In or¬ 
der to be able to cover a wide range of likelihood functions, 
we consider the following probabilistic model: 


Pi W) = n £{w ik -\ w ), p(H) = n £{h kj -\ h ) 

ik kj 

p(V|WH) = Y[TW(v ij -,y2 w *hk j ,<f>,P) (13) 

ij k 


where V e IR+ X J , W e ]R+ xX , and H e R+ x J . Here, £ 
and TW denote the exponential and Tweedie distributions, 
respectively. The Tweedie distribution is an important spe¬ 
cial case of the exponential dispersion models (Jprgensen, 
1997) and has shown to be useful for factorisation models 
(Yilmaz et al., 2011). The Tweedie density can be written 
in the following form: 

s) «*(-£<«*)) 


where fi is the mean, <f is the dispersion (related to the vari¬ 
ance), (3 is the power parameter, K(-) is the normalizing 
constant, and dp(-) denotes the ^-divergence that is defined 
as follows: 




V 13 1 fJ,P 

/?(/?-1) 


The /3-divergence generalises many divergence functions 
that are commonly used in practice. As special cases, 
we obtain the Itakura-Saito divergence, Kullback-Leibler 
divergence, and the Euclidean distance square, for (3 = 
0,1,2, respectively. From the probabilistic perspective, dif¬ 
ferent choices of (3 yield important distributions such as 
gamma (/? = 0), Poisson (/3 = 1), Gaussian (/? = 2), 
compound Poisson (0 < (3 < 1), and inverse Gaussian 
(/3 = —1) distributions. Due to a technical condition, no 
Tweedie model exists for the interval 1 < (3 < 2, but for 
all other values of /3, one obtains the very rich family of 
Tweedie stable distributions (Jprgensen, 1997). Thanks to 
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the flexibility of the Tweedie distribution, we are able to 
choose an observation model by changing a single param¬ 
eter /?, without modifying the inference algorithm. 

In most of the special cases of the Tweedie distribution, the 
normalizing constant Kp) is an infinite sum and cannot 
be written in a simple analytical form. Fortunately, pro¬ 
vided that p and (3 are given, the normalizing constant be¬ 
comes irrelevant since it does not depend on the mean pa¬ 
rameter n and therefore W and H. Consequently, the PS- 
GLD updates only involve the partial derivatives of the /?- 
divergence with respect to Wi k and h k j , which is tractable. 

4.1. Experimental Setup 

We will compare PSGLD with different MCMC methods, 
namely the Gibbs sampler, LD, and SGLD. We will con¬ 
duct our experiments in two different settings: 1) a shared- 
memory setting where the computation is done on a single 
multicore computer 2) a distributed setting where we make 
use of a cluster of computing nodes 2 . 

It is easy to derive the update equations required by 
the gradient-based methods for the Tweedie-NMF model. 
However, developing a Gibbs sampler for this general 
model is unfortunately not obvious. We could derive Gibbs 
samplers for certain special cases of the Tweedie model, 
such as the Poisson-NMF (Cemgil, 2009) where (3 = 1 and 
p = 1. Moreover, in order the full conditional distribu¬ 
tions that are required by the Gibbs sampler, we need to 
introduce an auxiliary tensor and augment the probabilistic 
model in Equation 13 as follows: 

p(w ik ) = £{w ik ; \ w ), p(h kj ) = £(h kj ; X h ) 

P^Sijk') — VO(Sij k , Wi k h k j\ Vij — ^ ^ Sjjk 

k 

where VO denotes the Poisson distribution. 

The LD and Gibbs samplers require to pass on the whole 
observed matrix V at every iteration. The Gibbs sampler 
further requires the whole auxiliary tensor S = {sij k } G 
I /xJ x K t 0 b e sampled at each iteration. 

4.2. Shared-Memory Setting 

In this section, we will compare the mixing rates and the 
computation times of all the aforementioned methods in a 
shared-memory setting. We will first compare the methods 
on synthetic data, then on musical audio data. 

We conduct all the shared-memory experiments on a Mac- 
Book Pro with 2.5GHz Quad-core Intel Core i7 CPU, 16 
GB of memory, and NVIDIA GeForce GT 750M graphics 
card. We have implemented PSLGD on the GPU in CUDA 

2 For the source code for both settings (CUDA and OpenMPI), 
please contact the authors. 


C. We have implemented the other methods on the CPU 
in C, where we have used the GNU Scientific Library and 
BLAS for the matrix operations. 

4.2.1. Experiments on Synthetic Data 

In order to be able to compare all the methods, in our 
first experiment we use the Poisson-NMF model. We first 
generate W, H, and V by using the generative model. 
Then, we run all the methods in order to obtain the sam¬ 
ples For simplicity, we choose I = J 

and we set K = 32. In order to obtain the blocks, we parti¬ 
tion the sets [I] and [J] into B = 1/32 equal pieces, where 
we simply partition V by using a B x B grid, similar to 
the example given in Figure 1. Initially, we choose B dif¬ 
ferent parts whose union cover the whole observed matrix 
V, similar to the ones in Figure 1. At each iteration, we 
choose one of these parts in cyclic order, i.e. we proceed to 
the next part at each iteration and return the first part after 
iteration Bk with integer k > 1. Since the sizes of all the 
parts are the same, Condition 2 is satisfied. 

In LD, we use a constant step size e, whereas in SGLD 
and PSGLD, we set the step sizes as = ( a/t) b , where 
b G (0.5,1]. For each method, we tried several values for 
the parameters and report the results for the best perform¬ 
ing ones. In LD we set e = 0.2, in SGLD we set a = 1, 
b = 0.51, and in PSGLD we set a = 0.01 and b = 0.51. 
The results are not very sensitive to the actual value of a 
and b , provided these are set in a reasonable range. Fur¬ 
thermore, in SGLD, we draw the sub-samples with a 
with-replacement manner, where we set \Cl^ | = IJ/ 32. 

Figure 2(a) shows the mixing rates and the running times 
of the methods under the Poisson model for different data 
sizes. While plotting the log-likelihood of the state of the 
Markov chain is not necessarily an indication of conver¬ 
gence to the stationary distribution, nevertheless provides 
a simple indicator if the sampler is stuck around a low 
probability mode. We set the number of rows / = 256, 
512, 1024 and we generate T = 10000 samples from the 
Markov chain with each method. We can observe that, in 
all cases, SGLD achieves poor mixing rates due to the with- 
replacement sub-sampling schema while LD achieves bet¬ 
ter mixing rates than SGLD. Moreover, while the LD up¬ 
dates can be implemented highly efficiently using BLAS, 
the reduced data access of SGLD does not reflect in re¬ 
duced computation time due to the random data access pat¬ 
tern when selecting sub-samples from V. 

The results show that PSGLD and the Gibbs sampler seem 
to achieve much better mixing rates. However, we observe 
an enormous difference in the running times of these meth¬ 
ods - PSGLD is 700+ times faster than the Gibbs sampler 
on a GPU, while achieving virtually the same quality. For 
example, in a model with / = 1024 rows, the Gibbs sam- 
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(a) 


(b) 


Figure 2. Shared-memory experiments with a) the Poisson observation model b) the compound Poisson observation model. 


pier runs for more than 3 hours while PSGLD completes 
the burn-in phase in nearly 1 second and generates 10K 
samples from the Markov chain in less than 15 seconds, 
even when there are more than 1 million entries in V. Nat¬ 
urally, this gap between PSGLD and the Gibbs sampler be¬ 
comes more pronounced with increasing problem size. We 
also observe that PSGLD is faster than LD and SGLD by 
60+ folds while achieving a much better mixing rate. 

We also evaluate PSGLD with TW(r;/i,^ = 1.(3 = 
0.5) observation model, which corresponds to a compound 
Poisson distribution. This distribution is particularly suited 
for sparse data as it has a non-zero probability mass on 
v = 0 and a continuous density on v > 0 (Jprgensen, 
1997). Even though the probability density function of this 
distribution cannot be written in closed-form analytical ex¬ 
pression, fortunately we can still generate random samples 
from the distribution in order to obtain synthetic V. 

Since deriving a Gibbs sampler for the compound Poisson 
model is not obvious, we will compare only LD, SGLD, 
and PSGLD on this model. Figure 2(b) shows the perfor¬ 
mance of these methods for I = J = 1024. We obtain 
qualitatively similar results; PSGLD achieves a much bet¬ 
ter mixing rate and is much faster than the other methods. 

4.2.2. Experiments on Audio 

The Tweedie-NMF model has been widely used for audio 
and music modelling (Fevotte & Idier, 2011). In musical 
audio context, the observed matrix V is taken as a time- 
frequency representation of the audio signal, such as the 
power or magnitude spectra that are computed via short- 
time Fourier transform. Here, the index i denotes the fre¬ 
quency bins , whereas the index j denotes the time-frames. 
An example audio spectrum belonging to a short piano ex¬ 
cerpt (5 seconds) is given in Figure 3(a). 

When the audio spectrum V is decomposed by using an 
NMF model, each column of W will contain a different 
spectral template and each row of H will contain the acti¬ 
vations through time for a particular spectral template. In 
music processing applications, each spectral template is ex¬ 
pected to capture the spectral shape of a certain musical 



Figure 4. Illustration of the communication mechanism. There 
are 3 nodes and B — 3 is selected as the number of nodes. The 
numbers inside the blocks denote the nodes in which the corre¬ 
sponding blocks are located. At each iteration, node n transfers 
its H b block to node (n mod B) + 1. The blocks W b are kept 
in the same node throughout the process. This strategy implicitly 
determines the part to be used at the next iteration. 

note and the activations are expected to capture the loud¬ 
ness of the notes. 

We decompose the audio spectrum given in Figure 3(a) and 
visually compare the dictionary matrices that are learned by 
LD and PSGLD. The size of V is I = J = 256 and we set 
K = 8. For PSGLD, we partition the sets [ I] and [J] into 
B = 8 equal pieces and we choose the parts in cyclic order 
at each iteration. With each method, we generate 10000 
samples but discard the samples in the burn-in phase (5000 
samples). Figure 3(b) shows the Monte Carlo averages that 
are obtained by different methods. We observe that PSGLD 
successfully captures the spectral shapes of the different 
notes and the chords that occur in the piece, even though 
the method is completely unsupervised. We also observe 
that LD is able to capture the spectral shapes of most of the 
notes as well, and estimates a less sparse dictionary. Fur¬ 
thermore, PSGLD runs in a much smaller amount of time; 
the running times of the methods are 3.5 and 81 seconds 
respectively for PSGLD and LD - as a reference the Gibbs 
sampler needs to run for 533 seconds on the same problem. 

4.3. Distributed-Hybrid Setting 

In this section, we will focus on the implementation of PS¬ 
GLD in a distributed setting, where each block of V might 
reside at a different node. We will consider a distributed 
architecture that contains three main components: 1) the 
data nodes that store the blocks of V 2) the computational 
nodes that execute the PSGLD updates 3) the main node 
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Figure 3. a) The audio spectrum of a short piano piece b) The spectral dictionaries learned by PSGLD and LD. 
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Figure 5. RMSE values on MovieLens 10M dataset. 

that is only responsible for submitting the jobs to the com¬ 
putational nodes only at the beginning of the sampling pro¬ 
cess. 

In the distributed setting, we implement PSGLD by a mes¬ 
sage passing protocol in C using the OpenMPI library. PS¬ 
GLD is naturally suited for message passing environments, 
and the low level control on the distributed computations 
provide more insight than other platforms such as Hadoop 
MapReduce. On the other hand, it is straightforward to 
implement PSGLD in a MapReduce environment for com¬ 
mercial and fault-tolerant applications. 

In our implementation, we make use of an efficient com¬ 
munication mechanism, where we set the number of blocks 
B to the number of available nodes. As illustrated in Fig¬ 
ure 4, throughout the sampling process, each node is re¬ 
sponsible only for a certain W 5 block; however, at the end 
of each iteration it transfers the corresponding H 5 block to 
its adjacent node in a cyclic fashion. With this mechanism, 
the part 11 ^ is determined implicitly at each iteration de¬ 
pending on the current locations of the factor blocks W& 
and H 5 . Besides, as opposed to many distributed MCMC 
methods such as DSGLD, this mechanism enables PSGLD 
to have a much lower communication cost, especially for 
large /, J, and B values. 

We conduct our distributed-setting experiments on a clus¬ 
ter with 15 computational nodes where each computational 
node has 8 Intel Xeon 2.50GHz CPUs and 16 GB of mem¬ 
ory. Therefore, provided that the memory is sufficient, we 
are able to run 120 concurrent processes on our cluster. In 
our experiments, by assuming that the network connection 
between the computational nodes is sufficiently fast, we 
will assume that we have at most 120 computational nodes. 

We evaluate PSGLD on a large movie ratings dataset, 


MovieLens 10M (grouplens . org). This dataset con¬ 
tains 10 million ratings applied to I = 10681 movies by 
J = 71567 users, resulting in a sparse V where 1.3% of 
V is non-zero. In all our experiments, we set K = 50, 
/3 = 4> = 1, and we set B to the number of available nodes 
where we partition the sets [I] and [J] into B equal pieces 
similar to the shared-memory experiments. In these exper¬ 
iments, the sizes of the parts are close to each other, there¬ 
fore our communication mechanism satisfies Condition 2. 

In our first experiment, our goal is to contrast the speed of 
our sampling algorithm to a distributed optimisation algo¬ 
rithm. Clearly, the goals of both computations are differ¬ 
ent (a sampler does not solve an optimisation problem un¬ 
less techniques such as simulated annealing is being used), 
yet monitoring the root mean squared error (RMSE) be¬ 
tween V and WH throughout the iterations provides a 
qualitative picture about the convergence behaviour of the 
algorithms. Figure 5 shows the RMSE values of PSGLD 
and the distributed stochastic gradient descent (DSGD) al¬ 
gorithm (Gemulla et al., 2011) for 1000 iterations with 
B = 15. We observe that a very similar convergence be¬ 
haviour and the running times for both methods. The re¬ 
sults indicate that, PSGLD makes Bayesian inference pos¬ 
sible for MF models even for large datasets by generating 
samples from the Bayesian posterior, while at the same 
time being as fast as the state-of-the-art distributed opti¬ 
misation algorithms. 

In our last set of experiments, we demonstrate the scalabil¬ 
ity of PSGLD. Firstly, we differ the number of nodes from 
5 to 120 and generate 100 samples in each setting. Fig¬ 
ure 6 (a) shows the running times of PSGLD for different 
number of nodes. The results show that, the running time 
reduces almost quadratically as we increase the number of 
nodes until B = 90. For B = 120, the communication cost 
dominates and the running time increases. 

Finally, in order to illustrate how PSGLD scales with the 
size of the data, we increase the size of V while increas¬ 
ing the number of nodes accordingly. We start with the 
original dataset and 15 nodes, then we duplicate V in both 
dimensions (the number of elements quadruples) and set 
the number of nodes to 30. We repeat this procedure two 
more times, where the ultimate dataset becomes of size 
683.584 x 4.580.288 with 640 million non-zero entries and 
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Figure 6. Scalability of PSGLD. a) The size of the data is kept 
fixed, the number of nodes is increased b) The size of the data and 
the number of nodes are increased proportionally. 

the number of nodes becomes 120. Figure 6(b) shows the 
running times of PSGLD with T = 10 for increasing data 
sizes and number of nodes. The results show that, even 
though we increase the size of the data 64 folds, the run¬ 
ning time of PSGLD remains nearly constant provided we 
can increase the number of nodes proportionally. 

5. Conclusion 

We have described a scalable MCMC method for sampling 
from the posterior distribution of a MF model and tested the 
performance of our approach in terms of accuracy, speed 
and scalability on various modern architectures. Our re¬ 
sults suggest that, contrary to the established folklore in 
ML, inference methods for ‘big data’ are not limited to op¬ 
timisation, and Monte Carlo methods are as competitive 
in this regime as well. The existence of efficient samplers 
paves the way to full Bayesian inference; due to lack of 
space we have not presented natural applications such as 
model selection. 

We conclude with the remark that it is rather straightfor¬ 
ward to extend PSGLD to more structured models such as 
coupled matrix and tensor factorisation models. Here, sev¬ 
eral datasets are decomposed simultaneously and the dis¬ 
tributed nature of PSGLD is arguably even more attractive 
when data are naturally distributed to different physical lo¬ 
cations. 
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